OpenML <> Scikit-learn Hackathon

Logistics

Location

27th Floor, ring at “Probabl”

Montparnasse Tower 33 Avenue du Maine, West Entrance (on your left when leaving the train station), 27th Floor, ring at “Probabl”

Contact Persons

+33783822597 (François Goupil)

+33760407677 (Charlène Bizollon)

Communication Channel

Please join the OpenML slack server and the dedicated hackathon channel for easy communication and updates.

Slack: https://join.slack.com/t/openml/shared_invite/zt-2ktk2cj1c-r637o20pfCc0H7PS8OUGtA

Channel: conference

Wifi

SSD: :probabl.guest

PWD: :probabl.

Note: At some point be ready to use your own mobile data (we are experiencing some difficulties)

SSD: :probabl.eiffel-2.4

PWD: :probabl.

Note: low bandwidth

Schedule

June 24 - 09:00-18:00


09:00-09:30	Welcome	Coffee and Croissants
09:30-11:30	Introduction	- Short presentation of the scikit-learn and OpenML projects + Probabl (Joaquin and Pieter for OpenML, Guillaume Lemaitre for scikit-learn, Yann Lechelle for Probabl) - Quick round where everyone introduces themselves. - Plan break-out sessions or suggest new ones.
10:30-11:30	Breakout	(1) Organizing community events and Onboarding Contributors (Maren)
11:30-13:00	Lunch	Bouillon Chartier
13:00-14:00	Breakout	(2) Governance, Funding and Sponsorship (Adrin + François)
14:00-15:00	Breakout	(3) Future Collaboration between scikit-learn and OpenML (Guillaume)
15:00-18:00	Code	Code: explore each other’s projects.

After the official programme each day, there’s a suggested bar and restaurant to go to

Bar + Restaurant: Le Falstaff

June 25 - 09:00-18:00


09:00-10:00	Coffee/croissants	Croissant Talk by Joaquin
10:00-11:00	Breakout	(6) Collaboration Ecosystem for Open-Source Machine Learning
11:00-12:00	Breakout	(7) Academic and Industrial Scope of OpenML and Probabl in AI - Collaboration
12:00-13:00	Lunch	TranTranZai
13:00-14:00	Breakout	(5) Probabl Product Technical Discussion (Camille)
14:00-17:00	Coding	Coding
17:00-18:00	Breakout	(4) Development Tooling and Workflows

Bar + Restaurant: Food Society Paris

June 26 - 09:00-13:00


9:00-10:00	Breakout + coffee/croissants	Joaquin et al.: we have a bit of a delay checking out of our Airbnb but will be there shortly.
10:00-12:00	TBD	Open / Ad-hoc
12:00-13:00	Wrap-up
13:00-14:00	Lunch + end of the hackathon	Subway in Montparnasse

Breakout Sessions Ideas

This document contains the preliminary agenda and suggestion for OpenML <> scikit-learn hackathon paris ‘24. Breakout sessions are discussions where we can brainstorm or exchange our experiences on specific topics. Feel free to propose additional sessions.

💡 Feel free to add new session topics below, there is a template at the end.

1. Organizing Community Events and Onboarding Contributors [Day 1]

leader: Maren Westermann

description:

Share our experiences organizing hackathons. How do you attract attendees? How do you make sure that the work at a hackathon is fruitful? Where should you organize your hackathons, and how are they funded? Are online open-source sprints/hackathons an option for you?
What process and documentation should be in place to help onboard new contributors? How to get them started effectively, and how do you make sure they stay with the project?
How do we get our projects known to users?

notes:

community sprints: everyone is invited to take part, in particular newcomers, to contribute to an open source project
Important (FYI: we are a bit out of sync)
Have a list of curated issues
Start with documentation issues for new contributors
Come up with issues before the sprint need a curated list for first time contributors (especially beginner friendly ones)
meta-issues for a group of related issues: once one is fixed it can serve as a contribution template
documentation issues: people start by reading the documentation around the issues
documentation about how to contribute was missing and is too long
rewrote the contributors’ guide to be more concise but more beginner friendly + some video tutorials to get started with github-based contributions
keeping beginner issues away for the sprints (so other people don’t jump on it before)
difference between OpenML Hackathon and Scikit-learn sprints
OpenML Hackathons are 1 week and bigger
Core developer sprints vs. New contributors sprints
Openml has 7 repositories with different programming languages (backend, APIs, frontend,…)
Some documentation for first time contributors but not exhaustive
One-to-one mentoring to get started with a working dev setup
Typically requires several days’ investment
onboarding new contributors online and in person are two different processes
Hard to make people feel connected and stay long-term
Social aspect is important: organizing recurrent events (every few months) to development a more long term engagement
Joint Pyladies Paris / scikit-learn core contributors events
near one-to-one mentoring
recurrent every few months
a few hours in the evening
retention is low but allowed to developed social bounds
Important to have maintainers be present to slowly build a connection.
Retention is low, be realistic on expectations - but the outliers are what matters
Personal connection: if they know you, people are more likely to contribute
OpenML hackathons are useful for core maintainers to secure some time to contribute to the project for several solid hours in a row.
0 full-time contributor.
part time engineers for academic projects
nice locations because thanks to EU funding
Can OpenML use students to help contribute feature
Hard to get high-quality submissions
How to incentivize people?
Build career: show of with sklearn contribution on a resume (less long-term contribution)
Sense of community
Useful for your own research
Hard to have ‘flashy results’ (e.g. genAI apps), how can we solve that?
How to scale time investment?
PyLadies / sprints: only few hours in the evening (6-9pm)
Every 2 months, 15-30 (capped) people show up
How are sprints structured?
Pre-sprint: online, so people have the right setup
At least one organizer (e.g. Maren for pyladies, supported by core devs)
Shortlist of issues for each sprint
Paid internships: great way to find good people and build
Requires funding
Mentoring takes time, but helps people take over some tasks
Slack discussions
Generative AI? Bad quality, wastes time. SKlearn only allows human contributions.
How to focus attention? E.g. key project this quarter?

2. Governance, Funding and Sponsorship

leader: Adrin + François

description: What are our experiences with our governance structures? What are opportunities for open source projects to make money to pay for e.g., server costs, organizing events, and so on? How do we argue the importance of our projects to motivate a funder/sponsor? Can we quantify our contribution?

notes:

Governance is a living document. It matters for the community.
How to combine an open-source library with a for-profit company
Be clear about what parts are community-”owned” and which parts are company-owned
Keep discussions of the community aspect on community Slack. Decision processes must remain open.
write public version of important decisions
Still creates confusion (for users and contributors)
Keep people informed either through mailing lists, monthly meetings etc. (This takes a long time, but is worth it in the long run)
communicate upcoming discussions in mailing list
contributing to sklearn open doors to work at companies
INRIA foundation:
50k EUR for 2 meetings yearly where company can state priorities
No requirement, only if useful for community
Is it sustainable? For academic salaries. As long as sklearn stays useful companies will keep doing this. Requires that someone at the sponsoring company cares that sklearn doesn’t decline.
Also advertising (logo on website)
Modelled on Linux foundation.
Probabl:
Will put a RE to work on a certain issue, but for a lot more money
Companies prefer this (they don’t know how to hire a good sklearn engineer)
Leadership by effort:
Put your own time into the aspects that are important to you
NVIDIA: Have people work at other companies to work on sklearn
How much effort?
For projects that need faster cycles, create a separate package (e.g. skops, hazardous,…) - but this creates maintenance work.
Hard to get new reviews in since it takes so long. Multiple rounds of reviews even for a simple spelling error.
Library for putting sklearn models into prod → skops
Having documentation lead, community interaction lead,… does it help and how much?
Lowers the bar (core dev is a really high bar)
Speeds up decisions
Need more people in the different teams

3. Future Collaboration between Scikit-learn and OpenML [Day 1, 2 sessions]

leader: Guillaume Lemaitre

description: scikit-learn can fetch datasets from OpenML, users can automatically evaluate scikit-learn models on OpenML tasks. What other future collaborations are interesting to explore?

fetch_openml / download_openml improvements (parquet?)
dataset upload via parquet → coming (not fully supported by python API yet)
croissant integration in scikit-learn openml data fetcher? → Tuesday morning
provenance tracking and reproducibility for openml dataset (make it standard to provide a script, possibly hosted externally on GitHub or similar, to show how to reconstruct the openml hosted parquet file from the original dataset format/location.
collaborative feedback (per dataset issue tracker) to report and discuss dataset related problems with dataset owner/uploader
If fetch-openml fails, what to do?
add support for benchmarking? : benchopt

notes:

Fetch_openml
ARFF parser is a headache
Logically, it makes sense to first download the data file localy and then load it with pandas or polars (pandas in rust)
sklearn does not load parquet right now
Sparse datasets are still an issue (not supported by parquet)
Make all sparse dataset dense and store them in parquet (will still compress nicely). Most sparse datasets aren’t that large.
Some of these datasets may be one-hot-encoded datasets
Pyarrow is most supported (fastparquet is not). Polars can read parquet natively. Pyarrow does not support pyodide.
Have an explanation for differences between versions
Versions discussion
Versions of dataset on openml very confusing
Version not related lineage of dataset, very confusing to user
Versions of datasets not searchable
Use case: go on openml, search for a dataset, “which version of the dataset to select?”
Benchmarking
interesting for probabl.ai to share models and benchmarks on OpenML?
Sklearn Pipeline Representation
Use HTML widget for sklearn pipeline diagrams

todo:

openml: check whether all parquet file can be read with polars and pandas
openml: convert all sparse datasets to dense to store then in parquet
openml: have an explanation for differences between versions. When people upload a new version of a dataset, ask for an explanation.
openml: sort dataset by the quality of the datasheet. Show user/datasetname/id as the name on the webui, remove/rename “version”
openml website: implement a way to open an issue to contact the dataset owner
openml: datasheet has section on preprocessing, where people can point to a github link with preprocessing code, encourage users to do this (e.g. dataset quality score) and allow people to report problems
sklearn: try to load parquet files from OpenML in fetch_openml
openml: visualization of the sklearn pipelines (flow)

3.5 Croissant Talk

notes:

Github
Croissant is a metadata description format
Ml datasets are a combination of structured and unstructured data, which make them complicated to manage
Croissant was built on top of schema.org, and has more details relative to it
The format has 4 layers
dataset level metadata
resource description
content structure
ml semantics
Croissant does not require any changes to underlying data
Analysis and visualization tools work out of the box for all datasets
Using croissant, datasets can be exposed consistently throughout platforms
Collaborations with google, hugging face, google dataset search also exist
Openml has deeper dataset description by default, slightly lesser in HF and kaggle
Once loaded, datasets can be imported elsewhere (torch, tf etc) easily
Croissant editor - web app where you can use a GUI to enter the dataset descriptions
NeurIPS also now recommends using the Croissant format
Supports the Core RAI vocab for explainable AI
If images/other files - points to the path

todo:

integer precision and more detailed dtypes
How are uploaded files linked to each other?
Lineage of datasets

4. Development Tooling and Workflows

leader: Pieter

description: Automation is important to create more sustainable workloads and generally improves overall project quality. What tooling and workflows are employed in your projects to run tests, ensure code quality, help contributors, and so on? Which do you find most useful? Are there decisions have you come to regret? What are your major pain points?
What are our responsibilities as open source projects, should we be embracing platforms such as e.g., CodeBerg/Forgejo more?

Notes:

Switch to open-source tools like CodeBerg once if offers more conveniences
There is a GitHub maintainer org that you can apply to (if you are maintainer of an important enough package) that can give you more direct access to GH dev/projects.
Use of Azure workflows in scikit-learn is largely historical, but also provides a spread over different (free) usage limits
GPU actions with a limited budget
GitHub actions workflow problems
testing is a pain and not really supported (easier on Azure)
badish documentation but better than Azure
run documentation examples that are linked to changes in PR diff - use Circle CI because it can easily render the generated HTML in the browser (as opposed to GH where you download the artifact)
bot for linter errors to post it as comment helped a lot
aiming for 100% code coverage, including all validation though that is centralized. Disable coverage for certainly not tested parts of the code. Also test errors and types and warnings.

5. Probabl Product Technical Discussion

leader: Camille Troillard

description: Presentation/discussion of the Probabl technical product and potential collaboration.

What to do to put sklearn in production, to make it commercially viable
Help data scientists do better ML
Better understand their model’s behavior
Build something that we’re proud of (and we’re picky)
Let people do what they do (don’t interfere), but show interesting things along the way
You should not require a platform, but it should be very easy to switch to a platform
Educate people. E.g. you’re changing the metric but this metric doesn’t make sense.
Interactive dashboard that shows results of your experiment
code and outputs side by side
Like Weights and Biases, but runs locally
Outputs data in a portable DB, results are registered and predictions are shown when they are available
unified API to the whole infrastructure stack (like MetaFlow)
Button to ‘push to production’ or ‘push to OpenML’ depending on the user
“We’ve been spoonfed microservices in order to become addicted to CSPs”

Feedback for OpenML

Have a clear tagline, e.g. ‘Frictionless ML resources’
Better search interface
Nice visualizations for the run page
Fast website

6. Collaboration Ecosystem for Open-Source Machine Learning

leader: Lennart Purucker

description: What other open-source frameworks are struggling with the same questions we are struggling with? Should we reach out to them? Is there a need for a collaboration ecosystem in open-source machine learning/AI? What are lessons learned from which others might benefit? What are lessons learned from others from which we might benefit?

notes:

Struggles for Open Source
Copying and learning from scikit-learn projects
Bots, CI/CD, CI logic
see https://scientific-python.org/, https://scipy.org/
https://learn.scientific-python.org/development/
helps with CI/CD setup and provide more documentation on how to setup a open source project
Main issue is human traffic for open-source
opening issues, PRs, …
What other open-source frameworks are struggling with the same questions we are struggling with?
https://learn.scientific-python.org/contributors/setup/ecosystem/
ML Backbone
Scikit-learn, PyTorch, TensorFlow, MLR, MLJ
XGBoost, LightGBM, CatBoost
OpenML, Pandas, NumPy, SciPy, Polars
Python Backbone
Pip / PyPi, Conda, vu
Ray, Joblib
ML Applications / AutoML / …
AMTLK, Auto-Sklearn, FLAML, AutoGluon, H2O …
Should we reach out to them?
Company-driven open-source vs. community-driven open-source
company-driven example
tensorflow, (PyTorch)
Internal CI vs. open-source CI
community-driven
scikit-learn
Via GitHub
Is there a need for a collaboration ecosystem in open-source machine learning/AI?
Only if we have problems, otherwise unnecessary overhead.
Are they my dependencies or am I there dependencies?
What are lessons learned from which others might benefit? / What are lessons learned from others from which we might benefit?
mostly the governance documents
document CI
see https://scientific-python.org/
only start testing / maintaining other environments one request / when issues arise
https://scikit-learn.org/stable/developers/minimal_reproducer.html

7. Academic and Industrial Scope of OpenML and Probabl in AI

leader: Lennart Purucker

who else is joining? Yann Lechelle

description: Where do we see ourselves in the general field of AI/ML? Are we only tabular data? Are we connected to GenAI, Computer Vision, and NLP? How is our connection to industry applications? How do we effectively explain our position to stakeholders (who read too much about GenAI)?

Input Modalities
Tabular (OpenML, Scikit-learn)
Time Series
Vision (OpenML Soon)
NLP
Graphs
(Other)
Output Modalities / Task
Scalar Regression (OpenML, Scikit-learn)
Quantile Regression (OpenML, Scikit-learn)
Multiclass Classification (OpenML, Scikit-learn)
No-target / Unsupervised / Data Insights (OpenML, Scikit-learn)
Survival Analysis (Scikit-learn)
Forecasting
Anomaly Classification
Anomaly Detection (OpenML, Scikit-learn)
Generative AI: Structured Predictions
ML Techniques in AI/ML
Traditional ML Algorithms (SVM, RF, Boosting) (OpenML, Scikit-learn)
Traditional Deep Learning (OpenML)
Large Foundation Models
What do Stakeholders Understand?
Time Series
GenAI
Notes:
scikit-learn limitation is the API definition
probabl: “own your data science”
border scope, may include other things besides scikit-learn
also deep learning and large foundation models
scope is wide around open-source technology
can we connect OpenML to probabl scope
“exporting” the API?
LLMs “do” UX for ML

Aggregated To-Dos

Openml

check whether all parquet file can be read with polars and pandas
convert all sparse datasets to dense to store then in parquet
have an explanation for differences between versions. When people upload a new version of a dataset, ask for an explanation.
sort dataset by the quality of the datasheet. Show user/datasetname/id as the name on the webui, remove/rename “version”
website: implement a way to open an issue to contact the dataset owner
datasheet has section on preprocessing, where people can point to a github link with preprocessing code, encourage users to do this (e.g. dataset quality score) and allow people to report problems
visualization of the sklearn pipelines (flow)
data quality plots
Does the UX of OpenML need work

probabl

try to load parquet files from OpenML in fetch_openml

croissant

integer precision and more detailed dtypes
How are uploaded files linked to each other

Subhaditya's KB

OpenML <> scikit-learn hackathon

OpenML <> Scikit-learn Hackathon

Logistics

Location

Contact Persons

Communication Channel

Wifi

Schedule

June 24 - 09:00-18:00

June 25 - 09:00-18:00

June 26 - 09:00-13:00

Breakout Sessions Ideas

1. Organizing Community Events and Onboarding Contributors [Day 1]

2. Governance, Funding and Sponsorship

3. Future Collaboration between Scikit-learn and OpenML [Day 1, 2 sessions]

3.5 Croissant Talk

4. Development Tooling and Workflows

5. Probabl Product Technical Discussion

6. Collaboration Ecosystem for Open-Source Machine Learning

7. Academic and Industrial Scope of OpenML and Probabl in AI

Graph View

Table of Contents

Backlinks